Webpage: https://v993.github.io/Representative-Polarity-US-House/

drawing

Problem Overview:


U.S. politics have increasingly become more polarized and complicated in recent decades, a major focal point of today's political science. For many, this is cause for concern- and not only because of its impact on Thanksgiving dinners. Pernicious polarization has been linked to democratic erosion (Carnegie), and as the fastest polarizing country in the world (Brown), we as constituents have a responsibility to question what our representatives' motives are, and whether they have our best interests at heart. So what are our responsibilities? Many of us understand the importance of voting, but how can we keep track of what decisions our representatives are making for us, and whether or not we should vote for them down again the line? As important as voting is in the U.S., many of us have better things to do than endlessly scrutinize the representative we voted for because they were blue or red in that election that felt like yesterday, and yet, is somehow happening again, urging you out of your house and ruining your Tuesday afternoon.

Quantification of a representative on an ideological plane allows us to summarize who a representative is without studying their voting patterns or past policies. We can tell a lot about a representative by comparing them to their peers and contextualizing their ideology in relation to others. It would be a lot easier to say that a person is quantifiably -0.7 on an ideological scale compared to their colleagues than it would be to summarize the legislative decisions they have made.

Furthermore, it is by no means a new idea to conceptualize politicians in multidimensional space in order to understand their political ideology. Famously, Poole and Rosenthal estimate spatial coordinates for representatives using political choices: votes*. Poole and Rosenthal's DW-NOMINATE system (Dynamic Weighted NOMINAI-Three-step-Estimation) represents legislators in two-dimensional map showing how similar their voting records are, and theoretically, their political ideologies. This also means that the means by which we may compute a representative's position in this space are still confined to those politicians who we know through voting records. New representatives who have never voted obviously cannot have a NOMINATE score.

Understanding a candidate's place in the context of their fellow representatives could be a useful tool toward understanding where politicians lie (no pun intended), and how they can be quantified- prior to election. In this project, I hope to use data on political representatives and the districts they represent to quantify their political ideologies and predict NOMINATE scores without using voting records. The main goal of this project is to make large-scale politics more digestible. I've done some work on this in the past (New York Data Project) but never in a learning capacity (i.e. machine learning). My work will be centered in this website. I will focus specifically on House representatives. There are more representatives to train on, and I believe we will encounter more variance on a lower level of congress.


The central questions which will guide this study are the following:

  • Are data on representatives alone adequate to predict NOMINATE scores?
  • What features of a representative have the largest impact on their ideology?
  • Can NOMINATE scores be used to predict features of a representative (e.g. age, aspects of constituency, finances)


Citations:

  • Carnegie, “What Happens When Democracies Become Perniciously Polarized?
  • Brown, “U.S. is Polarizing Faster Than Other Democracies”
  • *Poole, Keith T., and Howard Rosenthal. “A Spatial Model for Legislative Roll Call Analysis.” American Journal of Political Science, vol. 29, no. 2, 1985, pp. 357–84. JSTOR
# load pretty jupyter
%load_ext pretty_jupyter
%%capture
%pip install numpy pandas thefuzz matplotlib importlib
import re
import numpy as np
import pandas as pd
from thefuzz import fuzz
from math import floor, ceil
import matplotlib.pyplot as plt
import fresh_data.get_datasets
import importlib
importlib.reload(fresh_data.get_datasets) # reload get_datasets every time this cell is run
from fresh_data.get_datasets import *

# plt.rcParams['axes.grid'] = True # Universal grids for plots
plt.rcParams.update({'font.size': 22}) # Universal font size for plots

# Set facecolor for plots, best for exporting
plt.rcParams['axes.facecolor']='white'
plt.rcParams['savefig.facecolor']='white'

Data/Resources:


In order to understand representatives ideology without their voting records, we're going to need some good substitutes. To cover our bases, I've pulled data for a representative's constituents and as much data on the representative I could find.

Representative Data:

The best sources of data on representatives that aren't vote-oriented are finance data, as all representatives are required by law to report their campaign and personal finances to the FEC. Two important sources of data here are OpenSecrets, which is a nonpartisan organization which tracks money in politics, and the FEC, from which some of the OpenSecrets data is collected.

  1. VoteView DW-NOMINATE scores of representatives in the house of congress
  2. OpenSecrets data on lobbying, campaign finance, and personal finances for congressional representatives
  3. FEC campaign finance data for congressional representatives

State Demographics:

A primary concern (at least it should be a primary concern) for representatives are their constituents. Constituents decide who holds public office, and this often attracts a particular person to represent a particular district. Using demographic data of a representative's constituents can give us vital insight into the ideology of the representative. Furthermore, state data is easier to find than data on representatives, and there are a fair number of sources here, not all utilized in this study. We need the most comprehensive representation of a member's constituency possible, so it would be ideal to collect data on a district level (the units members of the house represent). Sadly, good sources of district-level data are sparse. For the sake of this study, we will extrapolate state demographic information across a state's representatives. This approach is not ideal for reasons we will see later in our data exploration, but we will need to make do.

  1. Pew Research Center religious populations in each state, and questions from the census on belief in god
  2. US Census decennial population and geodata per state
  3. KFF state demographics data including race and poverty statistics
  4. IRS data on SAIPE (Small Area Income and Poverty Estimates)

The range of data collected can hopefully give us a good idea of who the representative is, and we can filter down to more impactful features down the road if desired.

Extract, Transform, Load (ETL)

These sources have been neatly compiled into the following call to my data package. For more details, please see the data wrangling scripts in the Github for this study, linked here. The Github dives into the wrangling steps I took to form this dataset. The sources linked above are diverse, and the methods of loading these sources into my DataFrame took a combination of web scraping, file downloading en masse, and API calls. Entity resolution, not to mention data discrepancies, was no simple task. I'd estimate that 20% of the work for this project is what you see in this notebook, the remainder was entirely ETL.

The get_df() function call you see below performs all data extraction, transformation, and loading for the utilized datasets. Due to the nature of my fuzzy matching algorithm, the runtime of this operation is around 1.5 minutes. Because of the duration of this operation, I have elected to save this file to "full_df.csv" (available in my Github). If desired, one could uncomment the ETL line of code below and use a freshly generated source, but for the purposes of this project, using the .csv is sufficient.

# ETL code present in data.py, which can be freshly generated using this code:
import data
import importlib
importlib.reload(data) # reload get_df every time this cell is run
from data import get_df

# df = get_df() # Get full DF (takes approximately 1 minute with current merging strategy):
# df.to_csv("full_df.csv", index=False)
# Instead of generating new data, we can use pre-generated data located in 'full_df.csv':
df = pd.read_csv("full_df.csv")
df.head(3)
representative state_name district_code party congress year_range born age nominate_dim1 nominate_dim2 ... hindu historically_black_protestant jehovahs_witness jewish mainline_protestant mormon muslim orthodox_christian unaffiliated_religious_nones population
0 DICKINSON, William Louis Alabama 2 Republican Party 101 1989-1991 1925.0 83.0 0.398 -0.057 ... 0.01 0.04 0.01 0.01 0.01 0.01 0.01 0.01 0.01 4447100.0
1 BEVILL, Tom Alabama 4 Democratic Party 101 1989-1991 1921.0 84.0 -0.213 0.976 ... 0.01 0.04 0.01 0.01 0.01 0.01 0.01 0.01 0.01 4447100.0
2 NICHOLS, William Flynt Alabama 3 Republican Party 101 1989-1991 1918.0 70.0 -0.042 0.872 ... 0.01 0.04 0.01 0.01 0.01 0.01 0.01 0.01 0.01 4447100.0

3 rows × 52 columns

Interesting Data Inconsistencies

  • Sander Levin is a representative from Michigan who at two points in his career represented two different districts in Michigan. From 1983-1993, he represented the 17th district, and did not recieve any contributions (at least on file with the FEC). In 1993 he retired from the 17th district to campaign and win the 12th district House seat for Michigan. This seat would later get redistricted to the 9th district in 2012. In 2017, he announced he would not run for reelection in 2018. Instead, his son, Andy Levin, became his successor as the representative for the 9th district.

  • Alaska, Wyoming, Montana, North Dakota, South Dakota, Vermont, and Delaware all have only one seat in the US House of Representatives. Different databases encode this single district number differently, as either 01, or 00. These needed to be manually recoded.


  • Redistricting:
    • Ron Barber is a representative from Arizona who represented the 2nd district. He took office in 2012, when the district was still numbered as the 8th district- it was redistricted in 2013. VoteView's polarization notes this under the 8th district in the 112th session of congress (2011-2013), while the FEC records his campaign finance information for 2011-2012 under the 2nd district. This requires manual reconciliation.
    • Kathy Hochul is a representative from New York who represented the 26th district. She took office in 2011, and lost the election in 2013 when the district was redistricted into the 27th district. VoteView's polarization notes this under the 26th district, yet the FEC again records her campaign finance information for 2011-2012 under the 27th district, which is inaccurate. This requires manual reconciliation.
    • Conor Lamb is a representative and attorney from Pennsylvania who represented the 17th and 18th districts. Conor took office in 2018 representing the 17th district after his predecessor, a pro-life Republican, resigned following a scandal involving his desire for a mistress to have an abortion. Following his election into office, the 2019 redistricting made the area far more Republican. Conor Lamb, a Democrat, ran for the 18th district following the redistricting, giving up his seat in the 17th district in order to win the election for the 18th.


  • Data discrepancies:
    • Year encodings make matching on these datasets wildly difficult. The most confusing part of researching this part of my project was understanding electoral cycles (despite being a political science major in undergrad). I quickly realized that trying to match based on year would not work out for several reasons. Not only are electoral cycles measured slightly differently across different sources, but financial records for campaigns often encompass spending in the years leading up to an election and after, making matching much more difficult. This resulted in the current implementation, matching based on name, district, and state, a time-consuming and inefficient method of matching which takes a very long time.
    • The FEC reports finances for candidates, while OpenSecrets and our polarization DB both include information for representatives. The FEC includes individuals who ran for office, even if they didn't hold office.
    • The FEC includes finances for representatives of non-voting entites like Guam or the Virgin Islands, which are part of the United States but have no representation in congress. These have to be excluded from our study.
    • For whatever reason, the following representatives do not have campaign finance information for some of their campaigns on file with the FEC. The cause of this could be incorrect parsing (though the number of times I have checked begs to differ), or perhaps very, very cheap campaigns. Some of these could be incumbents who didn't spend money on campaigns or didn't have competititon. Here is the full list:
      • LaTOURETTE, Steven C: [106, 107]
      • FORBES, J. Randy: [107]
      • ISAKSON, Johnny [107]

Exploratory Data Analysis (EDA):


A Brief note on DW-NOMINATE

Per Poole and Rosenthal:

  • The first dimension picks up differences in ideolology, which is represented through the "liberal" vs. "conservative" (also referred to as "left" vs. "right") proportions throughout American history. Negative denotes a liberal disposition, positive a conservative.
  • The second dimension picks up differences within the major political parties over slavery, currency, nativism, civil rights, and lifestyle issues during periods of American history.

For most purposes, the second dimension isn't as relevant. For the purposes of this study, we will focus on the first dimension. Using the 101st session from 1989 and our most recent 116th session of congress, we can see a little of what political science frequently (and unnecessarily) reminds us: America is becoming more polarized. The difference is pretty striking visually:

congress116 = df[df["congress"]==116]
congress101 = df[df["congress"]==101]

fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2,figsize=(18,8))

ax1.set_xlabel("nominate_dim1")
ax1.set_ylabel("Frequency")
ax1.hist(congress101["nominate_dim1"], alpha=0.5, label="101st session")
ax1.hist(congress116["nominate_dim1"], alpha=0.5, label="116th session")
ax1.legend(loc='upper right')
ax1.title.set_text("NOMINATE 1st Dim:");
ax1.grid()

all_years = df["year_range"].unique()
year_mask = [df[df["year_range"]==year]["nominate_dim1"].abs().mean() for year in all_years]
all_years = [int(year[:4]) for year in all_years]
ax2.plot(all_years,year_mask)
ax2.set_xlabel("Year")
ax2.set_ylabel("abs nominate_dim1")
ax2.title.set_text("Average Abs Polarity by Year")
ax2.grid()

fig.tight_layout()
# fig.savefig(f"nominate_scores_over_time.png", dpi=96)

This basic extraction shows us a pretty striking relationship. The histogram on the left shows the trend away from a uniform distribution, with both peaks (both ideological centers) becoming tighter and tigher clusters away from 0.0 (ideological moderate). The line plot on the right demonstrates the average absolute ideological polarity (across the isle), which we can see also trends away from 0.0.

Age in the House

Before diving into ideological scores, let's investigate the age values in congress and see what our demographics are like. Considering the stereotypical polarity of older generations in the U.S., examining the age distribution in congress could be useful.

print("Average age between 1989-2020:", df["age"].mean())
print("Median age between 1989-2020:", df["age"].median())
print("Max age in between 1989-2020:", df["age"].max())
Average age between 1989-2020: 58.08426095533324
Median age between 1989-2020: 56.0
Max age in between 1989-2020: 101.0

Definitely not a hip crowd by any metric. Not sure how a 101 year old held a public office within the last 3 decades. In any case, we have a diversely aged assortment of representatives here, as the following boxplot and histogram demonstrate. I'd argue the distribution leans a little too far to the older side.

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,8))

df["age"].plot.box(
    color=dict(medians='r'),
    widths=0.5,
    boxprops=dict(linestyle='-', linewidth=2),
    flierprops=dict(linestyle='-', linewidth=2),
    medianprops=dict(linestyle='-', linewidth=2.5),
    whiskerprops=dict(linestyle='-', linewidth=2),
    capprops=dict(linestyle='-', linewidth=2),
    grid=True,
    ax=ax1)
ax1.set_ylabel("age")
ax1.title.set_text("Age Boxplot")


ax2.hist(df["age"],color="orange",alpha=0.8)
ax2.set_ylabel("Count")
ax2.set_xlabel("age")
ax2.title.set_text("Age Histogram")
ax2.grid()

fig.tight_layout()

The results of these plots are a little scary- regardless of the idiom that "with age comes wisdom." Considering how difficult it has been to find data on congressional representatives, it will be interesting to see how age impacts a representative's polarity, especially with such a spread of data. The historgram on the right shows a right skew, and that tail is far longer than I think it should be.

New York Case Study

To illustrate the purpose of ideological scores and dive a little deeper into our representative data, let's visit New York. New York is a heavily liberal state ideologically speaking, mostly due to New York City. But upstate of NYC there are a good number of conservative voters, and we can see this reflected in the ideological plot below.

ny_116 = df.loc[(df["congress"] == 116) & (df["state_name"] == "New York")].groupby(["representative"], as_index=False)[["congress","nominate_dim1", "nominate_dim2", "party"]].agg(
    nominate_dim1_mean=('nominate_dim1', 'mean'),
    nominate_dim2_mean=('nominate_dim2', 'mean'),
    congress=('congress', 'first'),
    party=('party', 'first')
).sort_values("congress")

colors = {
    "Republican Party": "red",
    "Democratic Party": "blue"
}

party = {
    "Republican Party": "R",
    "Democratic Party": "D"
}

fig, ax = plt.subplots(figsize=(15, 8))

ax.scatter(x=ny_116['nominate_dim1_mean'], y=ny_116['nominate_dim2_mean'], c=ny_116['party'].map(colors), s=100)
ax.set_ylabel("nominate_dim2_mean")
ax.set_xlabel("nominate_dim1_mean")

delta = 0.005
for idx, row in ny_116.iterrows():
    ax.annotate(row['representative'], (row['nominate_dim1_mean']+delta, row['nominate_dim2_mean']+delta), fontsize=12)

fig.tight_layout()
# fig.savefig(f"ny_nominate_scores.png", dpi=96)

With this plot, the idea of NOMINATE scores becomes pretty obvious. There are pretty definite clusters here between party in both dimensions, with the NOMINATE 1st dimension denoting the liberal-conservative axis. As demonstrated by the graph, negative is liberal, and positive is conservative. As mentioned before, we will be exclusively focusing on the 1st dimension.

There is another significant observation to be made here; poltical ideology is heavily polarized within a state. This means that our plan to use state demographics will fail to account for the diversity of political ideologies within a state, on a district level. This could result in some complications in our model, but it is a concession we will have to make due to lack of data.

State Observations

We've seen that political ideology has significant variance within a state. But how do those patterns show up on a national level? To investigate, we will use choropleths utilizing state aggregates to understand the trends we have already observed, and draw conclusions about what a state says about a particular representative.

from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib as mpl

# Get geographical data for states from local geodata file:
states_geodata = geopandas.read_file('fresh_data/geodata/usa-states-census-2014.shp')

def build_choropleth(column="nominate_dim1", cmap="RdBu_r", all_time=False, manual_table=False, table=None, halfrange=None):
    """
    Builds choropleth according to specifications. 
    By default, generates polarity breakdown by state for 116th session of congress. 
    Params:
        - column: the column being aggregated on a state basis
        - cmap: cmap to be used for state coloration
        - all_time: flag denoting whether chart concerns all time, or a particular time period
        - manual_table: (sloppy) flag denoting intention to use a passed parameter 'table' instead of defaultly using global 'df' table
        - table: (sloppy) manual table to be used in place of global 'df' table
        - halfrange: param denoting the cmap centering norm to be used
    """

    # Get subset of states for 116th congress from global 'df' when no table is provided:
    if not manual_table:
        subset_116 = df[df["congress"]==116].groupby(["state_name"], as_index=False)[column].mean().sort_values(by=column)
        table = pd.merge(
            subset_116,
            states_geodata,
            left_on="state_name",
            right_on="NAME",
            how="left"
        )
    else:
        table = pd.merge(
            table,
            states_geodata,
            left_on="state_name",
            right_on="NAME",
            how="left"
        )
    # Build GeoDataFrame from current df:
    geo_df = geopandas.GeoDataFrame(table, geometry=table["geometry"])

    # Remove Hawaii and Alaska for whom we do not have geodata:
    hawaii_alaska_indices = geo_df[(geo_df["state_name"] == "Hawaii") | (geo_df["state_name"] == "Alaska")].index
    geo_df = geo_df.drop(hawaii_alaska_indices)


    # Normalize cmap from data:
    if halfrange:
        norm = mpl.colors.CenteredNorm(halfrange=halfrange)
    else:
        norm = None

    # Plot
    fig = plt.figure(1, figsize=(25,15));
    ax = fig.add_subplot();

    # Legend tweaks:
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=-1.8)
    ax.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False) 
    ax.tick_params(axis='y', which='both', left=False, right=False, labelleft=False) 
    ax.set_title(f"Average {column} by State{', 2019-2021' if not all_time else ''}")

    ax.set_frame_on(False)
    geo_df.apply(lambda x: ax.annotate(text=x.NAME, xy=x.geometry.centroid.coords[0], ha='center', fontsize=14),axis=1);
    us_map = geo_df.plot(
        ax=ax, 
        cmap=cmap, 
        norm=norm,
        figsize=(15,15), 
        column=column, 
        legend=True, 
        legend_kwds={"label": f"Mean {column}", "orientation": "vertical"},
        cax=cax,
    );

    return us_map.get_figure()

Average NOMINATE by State

To examine the breakdown of ideology on a state level, we'll focus on the 116th session of congress. Averaging all of the NOMINATE scores for this session on a state level allows us to color-code our choropleth appropriately, with a blue-red color mapping- blue being liberal and red being conservative:

fig = build_choropleth(halfrange=0.5)
fig.tight_layout()

This plot demonstrates a pretty obvious conclusion for most Americans: Red states are conservative and blue states are liberal. With that, we can already make assumptions about our predictive power on a state level, liberals in blue states will get predicted fairly well, and conservatives will get predicted well in red states. Unfortunately, this also means that we will likely suffer predictively when a representative's ideologies don't line up with their state's average scores.

Let's dive deeper into these metrics.

Signed Democratic Party Change

To examine the impact of party, we will estimate the signed difference between the average nominate scores of 1989’s congress and 2019’s congressional sessions. Negative average changes denote a liberal polarity shift, and positive average changes denote a conservative polarity shift.

df["nominate_dim1_difference"] = df.groupby("state_name")["nominate_dim1"].transform(lambda x: np.ptp(x))

def get_party_map(party):
    print(f"Loading data for {party}:")
    party_df = df[df['party'] == party]
    state_ideology_change = pd.DataFrame(columns=["state_name", f"{party} Ideology Change 1989-2021"])
    for state in party_df["state_name"].unique():
        state_df = party_df[party_df["state_name"]==state][["congress", "nominate_dim1"]]
        congresses_avg_dim1 = state_df.groupby("congress", as_index=False)["nominate_dim1"].mean()

        congress_1 = congresses_avg_dim1[congresses_avg_dim1["congress"]==congresses_avg_dim1["congress"].min()]["nominate_dim1"].mean()
        congress_2 = congresses_avg_dim1[congresses_avg_dim1["congress"]==congresses_avg_dim1["congress"].max()]["nominate_dim1"].mean()

        signed_difference = 0

        if congress_1 > congress_2:
            signed_difference = -abs(congress_1 - congress_2)
        else: 
            signed_difference = abs(congress_2 - congress_1)

        d={'state_name': state, f"{party} Ideology Change 1989-2021": signed_difference} 
        state_ideology_change.loc[len(state_ideology_change)]=d

    state_ideology_change.head(3)

    fig = (build_choropleth(column=f"{party} Ideology Change 1989-2021", cmap="RdBu_r", all_time=True, manual_table=True, table=state_ideology_change, halfrange=1.0))
    return fig
fig = get_party_map("Democratic Party")
fig.tight_layout()
fig.savefig(f"democrat_ideology_difference.png", dpi=96)
Loading data for Democratic Party:

The graph gives us a little insight into the polarity we examined in our first few data explorations. As this graph only includes Democrats- we can see that Democratic representatives tend to get more liberal, with the exceptions being red states from our first choropleth. This tells us that party lines do not rule ideological shifts; Democrats get more liberal or more conservative based on their home state. This means that regardless of political affiliation, representatives from red states are more conservative, and representatives from blue states are more liberal. But it also tells us that these states get increasingly entrenched in their dominant ideology, another indicator of increasing polarity.

Signed Republican Party Change

Computing the same for the Republican Party, we can see the same relationship:

fig = get_party_map("Republican Party");
fig.tight_layout()
fig.savefig(f"republican_ideology_difference.png", dpi=96)
Loading data for Republican Party:

We can observe the same relationship for Republicans, with a serious shift towards conservatism in South Dakota.

Correlation Heatmap

To evaluate our constituent data, let's focus on a few key features of each state's population and see how they correlate with our target, nominate_dim1. To do so, we'll use a correlation heatmap to visually depict relationships within our data. Ideally, we will be able to identify some patterns and make some decisions on feature selection. This enhances our understanding of variable interactions and gives us an idea of what direction we should take the model.

import seaborn as sns
import matplotlib

matplotlib.rcParams.update({'font.size': 12})

correlation_specifics = df[["nominate_dim1", "believe_in_god_absolutely_certain", "do_not_believe_in_god", "other_dont_know_if_they_believe_in_god", \
    "white", "black", "contributions_from_pacs", "contributions_from_individuals", "cash_on_hand", "debts", "total_poverty", "age", "receipts"]].corr()


fig, ax = plt.subplots(figsize=(15,10))   
ax =  sns.heatmap(correlation_specifics, vmin=-1, vmax=1, annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
ax.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

matplotlib.rcParams.update({'font.size': 22})

There are no strong correlations, but hopefully our use of these metrics in tandem will aid in our modeling efforts. Some standout results:

  • believe_in_god_absolutely_certain has the highest positive correlation with our nominate scores, we can assume that this could help us in modeling
  • personal finance data has a very small correlation with our nominate scores, so our predictions will likely suffer on a district level as anticipated
  • age has a relatively high correlation with nominate score, this could also be helpful

Single-Feature Regression

Using our highest correlation from the heatmap above, we will build a quick predictor to estimate nominate_dim1 without any other features:

from sklearn.linear_model import LinearRegression

# df.plot.scatter(x="believe_in_god_absolutely_certain", y="nominate_dim1")

holy_df = df.groupby("believe_in_god_absolutely_certain", as_index=False)["nominate_dim1"].mean()

fig, ax = plt.subplots(figsize=(15,8))

ax.grid()
ax.scatter(x=holy_df["believe_in_god_absolutely_certain"], y=holy_df["nominate_dim1"])
ax.set_ylabel("nominate_dim1")
ax.set_xlabel("believe_in_god_absolutely_certain")
ax.set_title("Simple Regression on 'believe_in_god' Feature")

lr = LinearRegression()
lr.fit(np.array(holy_df["believe_in_god_absolutely_certain"]).reshape(-1,1), np.array(holy_df["nominate_dim1"]).reshape(-1,1))

# ax.plot()

regression = [lr.predict(np.array([belief]).reshape(-1,1)) for belief in holy_df["believe_in_god_absolutely_certain"].unique()]

regression = np.array(regression).reshape(-1,1)
ax.plot(holy_df["believe_in_god_absolutely_certain"].unique(), regression, color="red", linewidth=3);
# regression = [predict(model, year) for year in data_continent["year"].unique()]
#     plt.plot(data_continent["year"].unique(), regression, label=continent, color=cmap(color),linewidth=3)

fig.tight_layout()
# fig.savefig(f"holy_regression_batman.png", dpi=96)

Machine Learning:


Data Preparation

We have observed some correlations and possible predictive power in our EDA, but for these first few predictive attempts, we'll be using most of our data. For the most part, the only excluded fields are metadata from merges, and unrelated fields like the second nominate dimension.

Additionally, some final cleaning is required to get our data into the appropriate shape for model use.

Our models will aim to address two predictive tasks:

  • Predicting DW-NOMINATE Scores
    • Nationally
    • State Level
  • Predicting Representative Traits
    • Age
    • Constituent Belief in God
    • State/District
    • Debts

We will then take what our best performing models learn and extrapolate further patterns in the data.

Refactor Party

We are mostly interested in Republican/Democrat representatives, so we can configure this as a boolean where Republican is False and Democrat is True:

# Remove smaller parties or lack of party affiliation:
df = df[(df["party"] == "Republican Party") | (df["party"] == "Democratic Party")]

df["party"] = df["party"].apply(lambda x: 1 if "Democratic" in x else 0)

Refactor Districts

Our districts are integer values which from a top-down perspective could have relationships with each other. This can impede model learning- the 5th California district has nothing to do with the 5th Wisconsin district. Districts must be understood within the context of their state, and this information is not presently captured by our data. To properly learn from districts, we will need to combine our state and district columns, and only then can we one-hot-encode. This will increase the dimensionality of our data significantly, but I believe it will aid in model learning.

df["district_code"] = df["state_name"].apply(lambda x: x.lower()) + "_district_" + df["district_code"].astype(str)

Reshaping/Refactoring Data

We will use a functional implementation as to tweak what our model learns from down the line, and further investigate feature importances as well as necessity. The functional implementation accomplishes the following automatically:

Refactor Int/Float: TensorFlow requires float32, which requires a minor refactor which you can see in the code below.

Categorical One-Hot-Encoding: In order for our models to appropriately interpret and utilize our categorical features, we will need to one-hot encode each of them.

Train/Test Split: I opted for a 90-10 train/test split arbitrarily, as our dataset is large enough that a 10% test set seems adequate for our purposes.

from sklearn.model_selection import train_test_split

# Default columns for learning (most of them):
columns_X = [
       ## representative data: 
       'state_name', 'district_code', 'party',
       'congress', 'year_range', 'born', 'age',
       'nominate_number_of_votes', 'running_as',
       'receipts', 'contributions_from_individuals', 'contributions_from_pacs',
       'contributions_and_loans_from_candidate', 'disbursements',
       'cash_on_hand', 'debts', 
       
       ## state data: 
       'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',

       # State data across years (only most recent data available):
       'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
       'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
       'do_not_believe_in_god',

       'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
       'historically_black_protestant', 'jehovahs_witness', 'jewish',
       'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
       'unaffiliated_religious_nones', 
       
       # State data from bicennial 2020, 2010, 2000 as closest-match
       'population'
]

def prepare_data(columns_X=columns_X, columns_Y=[ 'nominate_dim1', 'nominate_dim2' ]):

       y=df[columns_Y[0]] # Only target first dimension
       X=df[columns_X]

       int_cols = list(X.select_dtypes(include=[int]))
       X.loc[:, int_cols] = X[int_cols].astype("category")

       float64_cols = list(X.select_dtypes(include='float64'))
       X.loc[:,float64_cols] = X[float64_cols].astype('float32')

       y = y.astype('float32')

       X = pd.get_dummies(X, prefix=X.select_dtypes(include=[object, "category"]).columns, columns=X.select_dtypes(include=[object, "category"]).head(3).columns)

       return train_test_split(X,y,test_size=0.1)
       
x_train, x_test, y_train, y_test = prepare_data()

Predicting DW-NOMINATE Nationally

Model Selection

My approach in model selection is to do some surface level fitting on the training data with an assortment of models. I will select the highest default performers and tune their hyperparameters with grid searches to find my optimal model for this problem. We will repeat this process twice for both predictive tasks,

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def surface_level_test(model):
    model.fit(x_train,y_train)
    return round(model.score(x_train, y_train), 2), round(model.score(x_test, y_test), 2)

models = [
    ("LR", LinearRegression()),
    ("KNR", KNeighborsRegressor()),
    ("MLP", MLPRegressor()),
    ("ABR", AdaBoostRegressor()),
    ("GBR", GradientBoostingRegressor()),
    ("RFR", RandomForestRegressor())
]

for name, model in models:
    train, test = surface_level_test(model)
    print(f"- {name}: test: {test}")
- LR: test: 0.85
- KNR: test: 0.07
- MLP: test: -191208490.41
- ABR: test: 0.81
- GBR: test: 0.86
- RFR: test: 0.9

Surface-level training of the following algorithms found the following results:

  • LinearRegression: 0.79
  • TF-DNN (using MAE): 0.40 (not pictured above)
  • KNNRegression: 0.13
  • SKL-MLP: -2.3828207429 × 10^8 (I must have messed up training here)
  • AdaBoost: 0.89
  • GradientBoosting: 0.86
  • RandomForestRegression: 0.91

Considering the number of tests made, I opted to focus on my best-performing model and remove the unused models from this notebook.

Understanding scores:

RandomForestRegressor scores are the "coefficient of determination" which is definted as one minus the residual sum of squares divided by the total sum of squares. 1.0 is the best possible score, but scores can be negative (the model can be arbitrarily worse). The score is easily interpretable, providing a percentage of variance explained by the model relative to the total variance in the data.

*RandomForestRegressor.score

In order to tune the hyperparameters of our chosen model, I opted to use GridSearchCV, which will find the best performing combinations of specified hyperparameters to better fit the model on the data. The function below allows for a custom estimator and parameters to be utilized.

Additionally, the search has been parameterized with 5-fold cross validation for each test. I stuck with the default 5 as I wanted to reduce training time while maximizing use of my training data.

from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

def grid_search(estimator, params):
    grid = GridSearchCV(estimator=estimator, param_grid=params, cv=5, n_jobs=-1, verbose=10)
    grid.fit(x_train, y_train)

    print()
    print("\n Results from Grid Search " )
    print("\n The best estimator across all searched params:\n",grid.best_estimator_)
    print("\n The best score across all searched params:\n",grid.best_score_)
    print("\n The best parameters across all searched params:\n",grid.best_params_)

    return grid

Benchmarking

Below is a benchmarking function I wrote to quickly summarize the default scoring of a model and output the top feature importances. This will be handy when we run different tests on models and want to understand how they are progressing.

# Outputs relevant information about a model including default score and feature importances as a dataframe:
def benchmark_model(model):

    try:
        feature_importances = model.feature_importances_
    except:
        feature_importances = model.coef_

    feature_importances = pd.DataFrame({'importance': feature_importances}, index=x_train.columns).sort_values(by='importance', ascending=False)
    feature_importances["importance"] = feature_importances["importance"].apply(round, args=(4,))

    test_score = model.score(x_test, y_test)
    train_score = model.score(x_train, y_train)

    print(f"Train Score : { train_score }")
    print(f"Test Score  : { test_score }")
    display(feature_importances.head(20))
    return feature_importances.head(20)

RandomForestRegressor

Before tuning hyperparameters, let's take a preliminary look at our feature importances from the base model:

lr = LinearRegression().fit(x_train, y_train)

benchmark_model(lr);
Train Score : 0.8829188371230571
Test Score  : 0.8523750341200649
importance
multiple_races 4.6133
white 0.9624
black 0.7884
state_name_Hawaii 0.6997
state_name_California 0.3729
district_code_hawaii_district_1 0.3579
district_code_hawaii_district_2 0.3418
district_code_texas_district_36 0.3149
state_name_New Mexico 0.3001
district_code_arizona_district_6 0.2916
party_0 0.2915
district_code_california_district_48 0.2850
state_name_Texas 0.2837
state_name_Nevada 0.2821
state_name_Arizona 0.2752
district_code_wisconsin_district_9 0.2741
district_code_texas_district_14 0.2689
district_code_california_district_4 0.2683
district_code_texas_district_3 0.2663
district_code_ohio_district_8 0.2546
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)

benchmark_model(default_model);
Train Score : 0.9877739224511929
Test Score  : 0.9012254786181271
importance
party_0 0.4390
party_1 0.3445
running_as_Incumbent 0.0273
born 0.0156
disbursements 0.0114
age 0.0100
contributions_from_pacs 0.0091
contributions_from_individuals 0.0084
believe_in_god_absolutely_certain 0.0075
cash_on_hand 0.0073
black 0.0066
asian 0.0062
nominate_number_of_votes 0.0061
total_poverty 0.0049
multiple_races 0.0046
hispanic 0.0046
receipts 0.0039
believe_in_god_fairly_certain 0.0037
white 0.0037
do_not_believe_in_god 0.0036

We can observe here that party is disproproportionally important compared to the other top features. This makes sense from our EDA where party is a major indicator of ideology, but I was expecting to see a heavier importance on location, where we have a large amount of historical data to demonstrate how a representative might lean. This means that the polarization of representatives on a district level prevents us from using state data effectively, as I had feared earlier. The model tries to make up the difference by using representative data, which has a sparse correlation with our target feature, but ends up relying on party to determine a member's ideology.

These results are not bad, covering ~90% of our test set's variance is still a success. It would seem that our model is overfitting, but some hyperparameter tuning and data modification might be able to help alleviate this in later tests. The reliance on party should be limited. I am interested in how well we can predict without it. To see if my assumptions about the model's learning behavior in the face of state polarization variance are accurate, let's run another test.


Training with State Data

Below is a subset of our data which includes exclusively state data. If my hypothesis is accurate, we will get a very low accuracy due to our reliance on representative data to understand polarity within a state.

test_columns = [
    'state_name',  'district_code',

    ## state data: 
    'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',

    # State data across years (only most recent data available):
    'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
    'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
    'do_not_believe_in_god',

    'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
    'historically_black_protestant', 'jehovahs_witness', 'jewish',
    'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
    'unaffiliated_religious_nones', 
    
    # State data from bicennial 2020, 2010, 2000 as closest-match
    'population'
]

x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)

benchmark_model(default_model);
Train Score : 0.8384893234680975
Test Score  : 0.6581597603841642
importance
believe_in_god_absolutely_certain 0.0762
asian 0.0368
hispanic 0.0335
multiple_races 0.0323
white 0.0292
total_poverty 0.0288
population 0.0203
do_not_believe_in_god 0.0196
black 0.0192
district_code_georgia_district_5 0.0099
district_code_north carolina_district_12 0.0069
believe_in_god_fairly_certain 0.0069
district_code_texas_district_18 0.0069
district_code_texas_district_30 0.0066
district_code_mississippi_district_2 0.0066
district_code_louisiana_district_2 0.0064
district_code_missouri_district_1 0.0064
district_code_texas_district_20 0.0062
district_code_ohio_district_11 0.0061
district_code_georgia_district_4 0.0060

Evidently my hypothesis is accurate and without data on the representative, the model is crippled. Interestingly, we can see the highest correlation feature we studied in our earlier EDA (believe_in_god_absolutely_certain) is the most important feature in this space.

Consideration of the entire feature space is important considering the impacts of constituents on their representatives. But without district-level data, it appears that some of our state data is not aiding our modeling process. To evaluate, we'll need another test:


Training with Representative Data

Below is a subset of our data which includes exclusively representative data, as well as state. If my hypothesis is accurate, we will get a very low accuracy due to our reliance on representative data to understand polarity within a state.

test_columns = [
    ## representative data: 
    'state_name', 'district_code', 'party',
    'congress', 'year_range', 'born', 'age',
    'nominate_number_of_votes', 'running_as',
    'receipts', 'contributions_from_individuals', 'contributions_from_pacs',
    'contributions_and_loans_from_candidate', 'disbursements',
    'cash_on_hand', 'debts', 
    
    ## state data: 
    'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',

    # State data across years (only most recent data available):
    'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
    'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
    'do_not_believe_in_god',

    'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
    'historically_black_protestant', 'jehovahs_witness', 'jewish',
    'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
    'unaffiliated_religious_nones', 
    
    # State data from bicennial 2020, 2010, 2000 as closest-match
    'population'
]

x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)

benchmark_model(default_model);
from sklearn.ensemble import RandomForestRegressor

rfr_params = { 
    'n_estimators': [25, 50, 100, 150], 
    'max_features': ['sqrt', 'log2', None], 
    'max_depth': [3, 6, 9], 
    'max_leaf_nodes': [3, 6, 9], 
}

grid_rfr = grid_search(RandomForestRegressor, rfr_params)
 

GradientBoostingRegressor

I have arbitrarily chosen the following hyperparameters to be tested in a unique instantiation of a GradientBoostingRegressor (GBR). As a warning, the following code takes around 25 minutes to run.

test_columns = [
    'state_name', 

    ## state data: 
    'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',

    # State data across years (only most recent data available):
    'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
    'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
    'do_not_believe_in_god',

    'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
    'historically_black_protestant', 'jehovahs_witness', 'jewish',
    'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
    'unaffiliated_religious_nones', 
    
    # State data from bicennial 2020, 2010, 2000 as closest-match
    'population'
]

x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
# Outputs relevant information about a model including default score and feature importances as a dataframe:
def benchmark_model(model):

    feature_importances = pd.DataFrame({'importance': model.feature_importances_}, index=x_train.columns).sort_values(by='importance', ascending=False)
    feature_importances["importance"] = feature_importances["importance"].apply(round, args=(4,))

    test_score = model.score(x_test, y_test)
    train_score = model.score(x_train, y_train)

    print(f"Train Score : { train_score }")
    print(f"Test Score  : { test_score }")
    display(feature_importances)
    return feature_importances.head(20)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=0)

model.fit(x_train, y_train)

benchmark_model(model);
Test Score  : 0.9104885032679291
Train Score : 0.9874107098657662
importance
party 0.7830
running_as_Incumbent 0.0275
born 0.0169
contributions_from_pacs 0.0116
age 0.0115
... ...
state_name_Montana 0.0000
state_name_Hawaii 0.0000
state_name_North Dakota 0.0000
state_name_Vermont 0.0000
state_name_Alaska 0.0000

173 rows × 1 columns

importance
party 0.7830
running_as_Incumbent 0.0275
born 0.0169
contributions_from_pacs 0.0116
age 0.0115
disbursements 0.0105
cash_on_hand 0.0078
contributions_from_individuals 0.0078
nominate_number_of_votes 0.0068
black 0.0061
total_poverty 0.0060
believe_in_god_absolutely_certain 0.0060
believe_in_god_fairly_certain 0.0055
multiple_races 0.0054
receipts 0.0050
asian 0.0049
hispanic 0.0045
contributions_and_loans_from_candidate 0.0044
debts 0.0038
do_not_believe_in_god 0.0037
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(n_estimators=50, random_state=0)

model.fit(x_train, y_train)
print(model.score(x_test, y_test))

output_importances(model.feature_importances_)

model.estimator_weights_
0.13503505585450315
array([0.57892607, 0.65830094, 0.33057098, 0.31857633, 0.30836485,
       0.12349047, 0.36056242, 0.27059642, 0.1697619 , 0.32980739,
       0.26745244, 0.08591508, 0.26265596, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])
from sklearn.ensemble import AdaBoostRegressor

abr_parameters = {
    'estimator_max_depth':[i for i in [3, 5, 7]],
    # 'estimator_min_samples_leaf':[5,10],
    'n_estimators':[10,50,250],
    'learning_rate':[0.01,0.1]
}

grid_ABR = grid_search(AdaBoostRegressor(), abr_parameters)
Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10.
[CV 3/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10.
[CV 2/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10.
[CV 4/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10.
[CV 1/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50.
[CV 5/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10.
[CV 2/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50.
[CV 4/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50.
[CV 3/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250
[CV 4/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250
[CV 3/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50.
[CV 5/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250
[CV 1/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10..
[CV 2/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10..
[CV 3/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10..
[CV 4/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10..
[CV 5/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50.
[CV 5/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10..
[CV 1/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50..
[CV 2/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50..
[CV 3/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50..
[CV 4/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50..
from sklearn.ensemble import RandomForestRegressor

gbr_parameters = {
    'learning_rate': [0.01, 0.02, 0.03, 0.04],
    'subsample'    : [0.9, 0.5, 0.2, 0.1],
    'n_estimators' : [ 500 ], # 750, 1000, 1500
    'max_depth'    : [4,6,8,10],
    'estimator' : RandomForestRegressor(max_depth=3)
}

grid_GBR = grid_search(GradientBoostingRegressor(), gbr_parameters)
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 1/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5
[CV 1/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9
[CV 4/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5
[CV 2/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9
[CV 3/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9
[CV 4/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9
[CV 2/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5
[CV 5/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5
[CV 2/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2
[CV 5/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9
[CV 1/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2
[CV 3/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5
[CV 4/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2
[CV 3/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2
[CV 5/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2
[CV 1/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1
[CV 2/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1
[CV 3/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1
[CV 4/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1
[CV 5/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1
[CV 1/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9
[CV 2/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9
[CV 3/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9
[CV 4/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9

NOMINATE Predictor Final Model

The results of the grid search yielded the following model, which upon instantiation (which takes approximately 1.5 minutes), gives us the following accuracies:

final_model = GradientBoostingRegressor(learning_rate=0.04, max_depth=10, n_estimators=500,subsample=0.9)
final_model.fit(x_train,y_train)
GradientBoostingRegressor(learning_rate=0.04, max_depth=10, n_estimators=500,
                          subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print("Train:",final_model.score(x_train, y_train))
print("Test :", final_model.score(x_test, y_test))
Train: 0.9902131502094773
Test : 0.7805524891099352
output_importances(final_model.feature_importances_)

Evidently our model is overfitting quite a bit...

from sklearn import metrics

y_pred = final_model.predict(x_test)
y_true = y_test

print("Coefficient of Determination", metrics.r2_score(y_pred, y_true))
print("MSE:", metrics.mean_squared_error(y_pred, y_true))
print("MAPE:", metrics.mean_absolute_percentage_error(y_true, y_pred)) # sensitive to relative errors

Feature Importances

 

Sample prediction:

df.iloc[3000]
# example prediction:

y_pred = final_model.predict(x_test.iloc[[0]])[0]
y_true = y_test.iloc[0]

print("representative: KING, Peter T.")
print("republican, 106th session of congress")
print("\tpredicted:", y_pred)
print("\ttrue y   :", y_true)

Predicting DW-NOMINATE Statewide

Predicting Representative Traits Nationally

  • Age
  • Constituent Belief in God
  • State/District
  • Debts

Predicting Representative Traits Statewide